Project 1

Introduction to Machine Learning

Authors : Maja Andrzejczuk & Julia Przybytniowska

Introduction

Our goal was to predict whether income exceeds $50000 per year based on census data in USA.

Features

Firstly, let's see the first rows of the table

We can see that we have no missing data marked 'NA. Let's look for concealed values.

Such values for categorical variables are "?" and numerical variables are "-100000".

Numerical variables have no missing data.

Let's change the cover-up values to NA for the duration of the analysis.

Teraz widać, że występują braki danych w kategoriach workclass, occupation oraz native_country.

The part which consists of rows with missing data:

We consider that in a situation where rows with missing data represent 7.45%, we should not delete them. So we will replace them with the mod of the values in that column.

EDA

We see that the variable "age" starts at age 17 and ends at age 90.

In addition, we see that at least 75% of the values of "capital_gain" and "capital_loss" are equal to 0.

Distribution of the target variable:

We have a lead of people earning more than 50,000 USD.

Numerical variables

We see that "age" starts at age 17. Seeing the drop and sudden jump between 80 and 90 we can conclude that everyone over 90 was given age = 90.
We see that in the "education_number" variable, the largest number is 9 - HS-grad, those who have completed high school. A small percentage is made up of people in the groups who have not completed high school.
We see that both in capital_loss and capital_gain are mostly zeros with possible outliers.
In the variable "hours_per_week" there is a large peek among people working around 40h per week - from the fact that this is full-time.

Outliers

From the graphs above, we concluded that there are no outliers for these variables because when we analyse the data as histograms/boxplots, we see that the values are not anomalies. The outliers fit into these distributions.

Correlations

In order to be able to check the correlation of our most important variable with other variables, we map it to number.

Pearson's Correlation

We see that the variable most strongly correlated with income_level is education_num. At the same level we see correlations with age, capital_gain, and hours_per_week. We see it is unlikely that we will consider the variable fnlwgt when building the model.

Spearman's Correlation

The correlation between the variables breaks down very similarly to the Pearson correlation.

Categorical variables

We see that with higher education the percentage of people earning above 50,000 increases.
The level of earnings of people who have finished school before high school are at a very similar level.
By far the highest earning race is the white race.
About 40% of men earn above 50,000 while among women the percentage is about 10%.
We can see that the overwhelming majority of the frame when it comes to their home country are Americans.

Validation - EDA

We do all the steps analogous to the part above.

Numerical variables

All our previous observations have come true.

We see that here also the highest correlation with income_level is with education_num. We also see a similar correlation for age, capital_gain and hours_per_week. The assumption of a negligible effect of fnlwgt also holds true.

Categorical variables

Here, too, all our previous observations have come true.

Feature engineering

Removal of unnecessary columns

We believe that the following variables are worth ignoring in later work:

Conversion of continuous variables into categorical variables

We decided to do the changes on a new frame so that there would be an opportunity to go back if necessary.

Let's start with the education variable:

We then considered it useful to restrict the set of values of the `marital-status' variable:

And to restrict the workclass variable:

We group the variable native_country, due to the large number of countries we decided to divide them into continents:

We also group the hours_per_week column by a value of 40 (full-time hours)

Next is age:

WOE

We noticed that some categorical variables still had a large number of categories, so we decided to reduce them using WOE

occupation

Based on the results, we could combine categories that have no logical reason why they could be taken as one. Such as 'Machine-op-inspct' and 'Farming-fishing'. We can only merge them into one if these results are repeated on the validation and training sets.

We combine categories with similar WoE:

Encoding of categorical variables

We will first deal with columns that have binary values:

The variables capital_gain and capital_loss are being normalised

Let's see how our frame looks now:

Conclusions (up to milestone 1)

We believe that we should dispense with variables:

We noticed that it was useful to reduce the number of categories in some of the variables and to group them, so we did this where possible. We then encoded the columns using HotEncoding.

Due to the large discrepancy in the values of the capital_gain and capital_loss variables, we concluded that it was best to apply MinMaxScaler to them.

Preliminary modelling

To choose the right model we will check the accuracy of nine of them.

1. Logistic Regression

Feature Importance

2. Decision Tree

3. KNeighbors

4. Random Forest

5. Gaussian NB

6. SVC (Support Vector Machine)

7. XGBoost

8. LightGBM

9. Ada Boost

Comparison:

We will now select the 4 models with the highest accuracy and combine them into collective models.

10. Voting - hard

11. Voting - soft

12. Stacking

Comparison of complex models:

After a preliminary analysis of the models and their accuracy, we see that with the default parameters they turned out to be the best:

Considering the execution time, memory resources and operational complexity of each model, in the following we will focus only on the LighGBM model.

Hyperparameter optimisation

Tuning Light GBM

XAI (Explainable artificial intelligence)

Summary